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Abstract. Logic programming is sometimes described as relational pro- 
gramming: a paradigm in which the programmer specifies and composes 
n-ary relations using systems of constraints. An advanced logic pro- 
gramming environment will provide tools that abstract these relations 
to transform, optimise, or even verify the correctness of a logic program. 
This talk will show that these concepts, namely relations, constraints 
and abstractions, turn out to also be important in the reverse engineer 
process that underpins the discovery of bugs within the security industry. 



1 Introduction 

Logic programming is a wonderful paradigm: it is wonderfully expressive and 
yet also comes equipped with some wonderfully elegant semantics. One legacy of 
the foundational work on semantics by pioneers such as Kowalski, Levi and van 
Emden, are suites of tools that we build and deploy within the field of research 
that we refer to as logic programming environments. Partial evaluators, program 
specialisation tools, and various program analyses are all formulated in terms 
of the base semantics proposed by these pioneers. These base semantics provide 
a way to judge the correctness of a program manipulation technique, and by 
applying abstraction methods, we can even synthesise program analyses from 
these base semantics in a systematic and principled way [5]. Abstraction is a 
powerful idea in program manipulation, but when coupled with the pantheon of 
semantics that exist in logic programming, the concept becomes doubly powerful: 
we just need to select a suitably expressive semantics and then abstract it in an 
appropriate way. These ideas and these tools are so much a part of our heritage 
that we give this rich corpus of work a second thought. 

The richness of the tooling that is available in logic programming becomes 
more evident when it is compared against the tooling that is available in re- 
verse engineering. Reverse engineering is the discipline of extracting information 
from a program when the source is unavailable. Reversing engineering (abbre- 
viated to reversing in the security sector) is routinely applied when performing 
a security audit on a commercial product that relies on software developed by 
a third-party such as a library. Security engineers also reverse to reason about 
the latest malicious programs and devise antivirus software. Reversing is also 
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necessary when auditing programs for vulnerabilities that are introduced by the 
compilation process itself, or are most evident at the level of the executable. 

The most popular tool that is used for reversing within the security com- 
munity is the IDA Pro dissembler [IT] . This dissembler divides an executable 
into (more or less) its basic blocks, presenting them visually to the engineer in 
a flow diagram. Needless to say, the major impediment to reversing is the enor- 
mous effort required to understand an executable even when it is presented as 
a flow diagram. As researchers in programming environments, we are conscious 
that tool support can underpin the development of a new program, and aid the 
understanding of an existing program. The problem of extracting information 
from a program — which is the very essence of reversing — is not new to us. We 
recognise it as the problem of discovering invariants in a program. The problem 
is more about how to migrate techniques from higher-level paradigms to the level 
of an executable. In this short paper, we shall show how the familiar ideas from 
logic programming — relations, constraints, and abstractions of constraints — 
can be reinterpreted and reapplied in the setting of reverse engineering. 

2 Where are the relations? 

The place to start has to be the base semantics. In assembler, the problem is 
not that the semantics is ambiguous (like some languages) ; the problem is more 
one of granularity. Instructions perform bit- wise operations on words rather than 
merely arithmetical and logical operations on variables. The foci of computation 
are words, bit sequences and control-flags. Moreover, these objects are referenced 
through pointers and pointer offsets rather than as local variables and as, say, 
elements of an array. These semantics can be modelled, at least partially, by 
using relations. The idea is to exploit the finite nature of machine words and 
model each word or register as a vector of bits. The before and after states of 
each instruction can then be represented as a relation between the bits of the 
input and output vectors. Such a relation can be described propositionally as a 
Boolean formula over the propositional variables in the input and output vectors. 
This approach of modelling is colloquially referred to as "bit-blasting" within 
the model-checking community, presumably because of the explosive nature of 
the technique. Bit-blasting was famously used within the CBMC tool [3] in 
which the loops of C programs are unwound to a fixed depth so as to search for 
violations against prescribed correctness properties. Although initially treated 
with some skepticism, bit-blasting has gained acceptance as SAT solvers have 
emerged that can check the satisfiability of very large formulae [T2] and SMT 
solvers have been developed that include bit vector theories [3] that directly 
support word-level instructions. 

Bounded model checking has been successfully applied to check invariants, 
and find circumstances in which invariants are violated, but it cannot extract a 
hitherto unknown invariant from a program. Nevertheless, the relational nature 
of bit-blasting does provide a base semantics that is compositional. To see this, 
consider a sequence of just two instructions that both add the constant one to the 
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same 32-bit register. The input and output relation for this increment operation 
could be expressed as a Boolean formulae / over the bits in the input and 
output vectors (ro, . . . ,^31) and (r , . . . , r 31 ) where and r[ are the variables 
that express the state of bit i in the register before and after the increment. 
To compose two increments, two formulae f± and fa are obtained from / by, 
respectively, systematically renaming the r\ variables to r", and renaming the 
variables to r'(. The conjunction f\ A f% then asserts a double increment on the 
vectors (ro, . . . , r^,\) and (r , . . . , r 31 ), albeit using a vector of temporary variables 
(tq, . . . , 731). (The (r ', . . . , r^) variables can be removed from the formula /1 A 
ji without loss of information by applying existential quantifier elimination. 
This can give a denser representation of the composed semantics though it is 
not strictly necessary.) By iterating this composition technique, it is possible to 
derive the relational semantics for a sequence of instructions of arbitrary length. 

3 Where are the constraints? 

One important idea in the analysis of logic programs is to use systems of con- 
straints to describe systems of constraints 0: systems of arbitrary Herbrand 
equations might be described by equations that are limited to depth-fc; systems 
of finite domain constraints might be described by conjunctions of Horn for- 
mulae that express definiteness dependencies [T|. We can reinterpret this idea 
for Boolean formulae and use formulae in one class to describe those in a more 
expressive class. Alternatively formulae could be described by systems of linear 
constraints. Using linear constraints as descriptions for formulae is more natu- 
ral than one would initially think for reversing. When formulae are derived by 
bit-blasting and composition, the relationship between input and output vectors 
often resemble systems of simple linear constraints. For instance, in the case of 
the double increment, the formula /1 A fi could be described by the constraint 
2 + ^i = o = Y^fLo 2 M mod 2 32 . The constraint is not actually linear but is a 
congruent constraint with a modulo of 2 32 which reflects the bounded nature of 
arithmetic that is expressed by the formula /1 A/2. Describing the function with 
the linear relationship 2 + X^_ 2 * r « ~ Eto would actually misrepresent 
fi A /2- This is because, if the register initially stored the value 2 32 — 1, then 
after the double increment, the register will contain 1 and not 2 32 + 1. Thus the 
relationship is only linear on a sub-range of the input data values. Congruence 
constraints are natural abstractions for reversing because they are already famil- 
iar to the reverse engineer. This is because a number of security vulnerabilities 
relate to moduli; such vulnerabilities typically arise because the programmer 
has overlooked the wrap-around nature of arithmetic. Moreover, a security en- 
gineer will pay close attention to the size of an operand when reconstructing an 
algorithm from an executable. 

An astute reader (and certainly a reverse engineer) will recall that a word 
can either be interpreted as a signed or an unsigned value. The congruence 
2+53i=n ^ %r ' 1 ~ Ei=o 2 M mod 2 32 stems from an unsigned treatment, otherwise 
congruence would be 2 + -2 31 r 3 i + ^ia 2V * = ~ 23 Mi + E»=o 2 M mod 2 32 



4 Andy King 



where r 3 i and r' 31 are the signs. However, observe that by adding 2 32 r 3 i + 2 32 r 31 
to both sides, the congruence reduces to 2 + Yli=o ^ %ri = £i=o m °d ^ 32 - 
Thus the same congruence conveniently describes both the signed and unsigned 
interpretation of words. 

Congruences reflect the bounded nature of computer arithmetic, but an equa- 
tion such as 2 + J2i=o ~ ^2i=o m °d 2 32 possesses solutions for the vari- 
ables (ro, . . . , rsi) and (r' , . . . , r 31 ) that are not 0-1 (truth) values. For instance, 
the congruence is satisfied by the assignment {ro i — * 2, r% i — > 0, . . . , r 3 i i — > 0, Tq i — > 
0, r'i i— > 2, 7*2 i— > 0, . . . , r 3 i i — ► 0}. Such an assignment has no clear relationship 
with a Boolean function: a Boolean function is characterised by its set of assign- 
ments to 0-1 values. It is therefore necessary to be clear as to how a Boolean 
function can be described by a congruence. Formally, this is role of the con- 
cretisation map: the concretisation for system of congruences is the Boolean 
function whose satisfying assignments constitute the 0-1 solutions of the system 
(any solution that assigns a value other than or 1 is simply ignored in this 
interpretation of a congruence) . 

4 Where are the abstractions? 

Stating the concretisation map (or dually an abstraction map) is much like 
providing a specification of a problem. Realising an algorithm that satisfies the 
specification and thus solves the problem is another thing entirely. Superficially 
it would seem that Boolean formulae and congruences are not closely related, 
and therefore it is not obvious how to find a system of congruences that best 
describe a given Boolean function. However, this problem can be recently solved 
using an iterative algorithm [5] . The force of this result is that it gives a way 
to describe the relational semantics of an instruction, or even a sequence of 
instructions, with a system of congruences: bit-blasting is first used to derive 
a formula for the sequence and then this formula is described by congruences. 
Then the invariants on the basic blocks can be derived by fixpoint techniques 
|10j that have been proposed for imperative programs. To illustrate these ideas, 
we return to reasoning about a double increment. For expositional purposes, we 
will suppose that a word is merely 4 bits wide. Then bit-blasting could derive 
the following system of (implicitly conjoined) formulae: 

r'i © r' a 
r' 2 (B(r' Q Ar[) 
r' 3 (B(r' Ar[Ar' 2 ) 

Note that the formula contains the intermediate variables (r , r' l7 r 2 , r 3 ) which 
could be eliminated to derive a (possibly smaller) formula that still relates the 
input and output vectors (ro, r\, r 2 , r 3 ) and (r ' , r" , r 2 , r 3 ). 

A congruent description is derived for / by first searching for a satisfying as- 
signment (model) of /. This can be readily accomplished with a SAT solver. One 
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ri © r 

T2 © (r An) 

r 3 © (ro Ari Ar 2 ) 
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such assignment is M\ that is given below as 0-1 vector, where the propositional 
variables are ordered as follows (ro , ri, ri, ra, r' , r[ , r 2 , r 3 , r ' , r'{ , r 2 , r 3 ) . 

Ah = (1,0,0,0,0,1,0,0,1,1,0,0) 
M 2 = (1,0,0,1,0,1,0,1,1,1,0,1) 
M 3 = (1,0,1,0,0,1,1,0,1,1,1,0) 



Mio = (1,1,1,1,0,0,0,0,1,0,0,0) 

The truth assignment Mi can be reinterpreted as the system of congruences 
Si. For instance, the single assignment ro i— » 1 gives rise to the single congru- 
ence ro = 1 mod 2 4 . Henceforth, for brevity, we omit the modulo, which in this 
circumstance is chosen to be 2 4 = 16 since words are 4 bits wide. 



Si = 

s 2 = 

S 3 = 



r = 1, n = 0, r 2 = 0, r 3 = 0, r = 0, r[ = 1, 
r 2 = 0,r 3 = 0,r ' = l,r'/ = l,r 2 ' = 0, r 3 ' = 

r = 1, ri = 0, r 2 = 0, r 3 = r 3 , r = 0, r' x = 1, 
^2=0,r 3 = r 3 ',r ' = l,r'i' = l,r 2 ' = 

r = l,ri = 0,r 2 = r 2 ,r 3 = r 3 ,r = 0, r[ = 1, 
-/ _ „// „i _ „// „// _ i _ i 

' 2 — '2 ) '3 — r 3 > ' — - 1 ) ' 1 — 1 



r + r = 1, n + r'/ = 1, 4r 2 + 4r 2 ' + 4 = 8r 3 + 4r + Ar[ + 8r 2 + 8r 3 ', 



= 1, 2r[ +4r' 2 + 2 = 8r 3 + 2r ' + 2r'/ + 4r 2 ' + 8r; 



The algorithm proceeds by searching for an assignment of / that is not 
described by the system Si- This gives the model M 2 which can be translated 
into another system of simple congruences 5* 2 . The system S2 is then derived 
from S\ and S' 2 by computing the merge of Si and S' 2 - This is the unique system 
that contains all the solutions of Si and S' 2 . This operation is not dissimilar to 
the affine hull that is used to merge systems of linear equations [7] . With S 2 in 
place, the algorithm continues by searching for a model M 3 of / that does not 
satisfy S 2 . Translating M 3 as a system of congruences gives S 3 which is then 
merged with S 2 to give S 3 that is also given in the table. This iterative scheme 
continues until Sio is derived. All the models of / are contained in Sio and thus 
the algorithm stops at this point. 

The system Sio contains relational information pertaining to the intermediate 
bits as well as the input and output bits. The intermediate bits can be eliminated 
by applying a triangular form [7] which makes explicit any hidden relationships 
between the input and output bits: 

ro=r£, n+r'{ = l, 2n + 4r 2 + 8r 3 + 2 = 1r'[ + Ar 2 ' + 8r 3 ' 

Interestingly, the relationships derived are richer than one would expect. We 
have inferred that the states of the low bits are not changed by the double 
increment; that the states of the bits in position one always change; and that 
upper bits differ by two. 
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5 Related work 

It has recently been pointed out that even recovering the control-flow graph is 
more complicated than one would initially expect [S] and, in fact, that IDA Pro 
often fails to reconstruct the complete control-flow graph. The problem stems 
in part from indirect calls, that is, when the address of a function is stored to a 
memory location pointed to by a register. The technical problem it is necessary 
to solve is to reason how intermediate instructions can possibly alter the value 
stored in the register and thereby infer that the address remained unchanged 
when the indirect call is resolved [5]. 

One notable body of work that also aims to support the reversing is the thesis 
work of Balakrishnan [5]. Balakrishnan, under the direction of Reps, has devel- 
oped a so-called value set analysis that attempts to uniformly track addresses 
and numeric values. They intelligently chose a simple form of modulo constraint 
to represent a non-continuous range of values. For example, in their notation 4[0, 
12] denotes the set {0, 4, 8, 12} that describes the sets {0, 8} and {8, 12} among 
others. The rationale for this approach is that it enables sets of addresses on 
some word alignment to be accurately represented. We consider this approach 
to be a major advance in the analysis of binaries, since it attempts to seamlessly 
support addresses and numeric values. 
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